# install new packages IN THE CONSOLE!
install.packages("pacman")Data import
Reading in data files
Wiederholung
Last week we…
- learned the difference between numerical and categorical data types
- produced our first plots with
ggplot2 - learned which plots are appropriate for different data types
Heutige Ziele
Today we will…
- learn how to eyeball a new data set
- learn how to import different data types
- learn how to enter data by-hand
- visualise a new dataset
0.1 Lust auf mehr?
1 Exploring data
- last week we worked with data provided in the R package
palmerpenguings- data from packages are a great way to learn data science tools
- but you’ll want to work with your own data at some point
- today we’ll learn the basics of reading data files into R
1.1 Packages
- we will start using the
pacmanpackage- the function
p_load()takes package names as arguments - it then checks if you have the package install
- if yes, then it loads the package
- if not, then it installs and loads the package
- the function
- this saves us from having to install new packages every time
# load packages
pacman::p_load("tidyverse", # wrangling
"janitor", # wrangling
"here", # relative file paths
"patchwork" # plot layout
)- we now have
tidyverseloaded, and the new packagesjanitorandhereinstalled and loaded
1.2 Built-in data
- let’s first explore some built-in data: the
irisdataset- how many variables are there?
- how many observations?
# load tidyverse
library(tidyverse)
# read-in iris dataset
data("iris")1.3 Explore a dataset
View(): open dataset- only run this in the console! Your document won’t render if you have it in your script
- the
head()function prints the first 6 rows of the data
# print dataset (first 6 rows)
head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
1.3.1 glimpse()
- from the
tibblepackage - gives a sideways preview of the dataframe
glimpse(iris)Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
1.3.2 summary()
- prints a summary of each variable
- minimum, maximum, mean of numerical variables
- number of observations per level of a categorical variable
table1
Example 1
- Try loading the built-in dataset
table1 - Explore the dataset using the functions we just learned
- Is there anything funny with the summary? How might we fix that?
2 Importing data
- we usually want to work with our own data, not built-in toy data
- we will focus only on rectangular data (i.e., tidy data)
- there are many different file types that data can take, e.g., .xlsx, .txt, .csv, .tsv
- try to avoid
.xlsx; if you have an Excel dataset try saving it as a .csv before reading it into R
- try to avoid
csvis the most common data file type: Comma Separated Values- this is what a simple CSV file looks like
Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
the first row (the “header row”) contains the columns names
the subsequent rows contain the data
how many variables are there? how many observations?
2.1 readr package
- the same data can be viewed as a table, just like we did with
penguinsandiris- but first we have to read in the data
- the
readrpackage (part oftidyverse) can load in most data types
read_csv(here::here("daten", "students.csv"))| Student ID | Full Name | favourite.food | mealPlan | AGE |
|---|---|---|---|---|
| 1 | Sunil Huffmann | Strawberry yoghurt | Lunch only | 4 |
| 2 | Barclay Lynn | French fries | Lunch only | 5 |
| 3 | Jayendra Lyne | N/A | Breakfast and lunch | 7 |
| 4 | Leon Rossini | Anchovies | Lunch only | NA |
| 5 | Chidiegwu Dunkel | Pizza | Breakfast and lunch | five |
| 6 | Güvenç Attila | Ice cream | Lunch only | 6 |
table1
Example 2
- Import the
students.csvdataset and save it as an object calleddf_studentsdf_is short for DataFrame; it’s a good idea to use a prefix before object names so we know what each object contains
- When importing data with
read_csv, some information is printed in the Console. What is printed? - Explore the dataset using the functions we just learned
- Do you see anything odd?
2.2 The here package
- how did R know exactly where to find the
datenfolder? - our working directory is set to the location of our RProject within our computer
- whenever we want to access data in our RProject, we should use
here::here() - to see where
here()is starting from, runhere()
here()[1] "/Users/danielapalleschi/Documents/IdSL/Teaching/SoSe23/BA/ba_daten"
- this will look different for all of our machines
- but what should be the same is our folder structure within our projects (e.g.,
daten/students.csv)
- but what should be the same is our folder structure within our projects (e.g.,
here Paket
Before the here package, we used to have to explicitly tell R where on our computer a file was located (e.g., /Users/danielapalleschi/Documents/IdSL/Teaching/SoSe23/BA/ba_daten/daten/students.csv), or use the setwd() (set Working Directory) function to tell R where to assume where all files are located (e.g., setwd(/Users/danielapalleschi/Documents/IdSL/Teaching/SoSe23/BA/ba_daten)). Luckily, you never need to use these absolute file paths or setwd()!
From the here package documentation:
The goal of the here package is to enable easy file referencing in project-oriented workflows. In contrast to using
setwd(), which is fragile and dependent on the way you organize your files, here uses the top-level directory of a project to easily build paths to files.
This means we now have the huge benefit of being able to move our project folder anywhere, and our file path will still be relative to wherever we’ve moved our project fold. This mean the project runs independent of where on your computer it is located. You can also send somebody the project folder, and everything should run on their machine!
2.3 Missing values
- data transformation refers to ‘fixing’ our data when it’s not ‘tidy’
- in our
df_studentsdataframe, you might’ve noticed someNAorN/AvaluesN/Awas written as text, and so R reads it as suchNAs in R refer to missing data (“Not Available”)
- last week we saw some warning messages when we created our scatterplots
- these warnings were telling us about missing values (
NAs) that were not plotted
- these warnings were telling us about missing values (
- true missing values are completely empty, so having
N/Awritten in ourdf_studentsdata is not actually read as a missing value - to fix this, we can use the argument
na =for theread_csv()function- this argument tells
read_csv()which values it should equate with missing values
- this argument tells
df_students <- read_csv(here::here("daten", "students.csv"),
na = "N/A")
head(df_students)# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only "4"
2 2 Barclay Lynn French fries Lunch only "5"
3 3 Jayendra Lyne <NA> Breakfast and lunch "7"
4 4 Leon Rossini Anchovies Lunch only ""
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch "five"
6 6 Güvenç Attila Ice cream Lunch only "6"
- now the value that was formerly “N/A” is read as an
NA- but what about the empty cell?
- we have now overwritten
read_csv()reading empty cells asNA- how can we tell
read_csv()to read more than one type of input asNA? - i.e., we want to tell it to read “” and “N/A” as
NA
- how can we tell
df_students <- read_csv(here::here("daten", "students.csv"),
na = c("N/A",""))
head(df_students)# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
2.4 Column names
- if we print
df_studentsin the Console we’ll see that the first two columns names are surrounded by backticks (e.g.,`Student ID`)- this is because they contain an empty space, which is not syntactically valid (variable names need to start with a letter and not contain spaces or special characters)
- a quick fix is to the function
clean_names()from thejanitorpackage
janitor::clean_names(df_students)# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
- that looks better!
- but if you now run
head(df_students), do you see the cleaned column names?
- but if you now run
- when we pass an object through a function, the object is not ‘updated’
- so we have to again assign the object
df_students <- janitor::clean_names(df_students)- but we often know that we want to run multiple functions (
read_csv(),clean_names()) on the same object- we can do that using pipes
2.5 Pipes
- pipes are placed at the end of function call when the result of this function should be passed through a subsequent function
- they can be read as “and then…”
read_csv(here::here("daten", "students.csv")) %>%
head()# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
- there are currently 2 pipes that can be used in R
- the
magrittrpackage pipe:%>% - the new native R pipe:
|>
- the
- so far I haven’t discovered any major difference between the two, but for now I’m sticking with
%>%
Example 3
- Load the
students.csvdataset again with fixedNAs and then- Use a pipe to call
clean_names()on the dataset, and then - call the
head()function
- Use a pipe to call
- Load the
students.csvdataset again with fixedNAs, saving it as the objectdf_students, and then- use a pipe to call
clean_names()on the data set
- use a pipe to call
- Why shouldn’t you use a pipe and the
head()function when you’re saving the dataset as an object?
2.6 Other file types
- once you’re comfortable with
read_csv(),readr’s other functions are easy to use- you just have to know when to use which ones
read_csv2()reads semicolon-separated files- These use
;instead of,to separate fields and are common in countries that use,as the decimal marker
- These use
read_tsv()reads tab-delimited filesread_delim()reads in files with any delimiter- will try to guess the delimiter unless you specify it with the argument
delim =(e.g.,read_delim(students.csv, delim = ","))
- will try to guess the delimiter unless you specify it with the argument
Others I haven’t yet needed:
read_fwf()reads fixed-width filesread_table()reads a common variation of fixed-width files where columns are separated by white spaceread_log()reads Apache-style log files
Example 4
- What function would you use to read a file where fields were separated with “
|”? - What arguments do
read_csv()andread_tsv()have in common? - How would you load in a dataset with a semicolon as delimiter?
- Load in the dataset
nettle_1999_climate.csv- how many variables are there?
- Load in the dataset
nettle_1999_climate2.csv- how many variables are there? Is this right?
- Load in the dataset
nettle_1999_climate3.csv- how many variables are there? Is this right?
3 Data entry
- when collecting small amounts of data, you may want to enter them by hand into R
- there are two useful functions that help us take collected data and create a
tibble tibbles are modern dataframes, don’t worry about the definition of a tibble just yet
- there are two useful functions that help us take collected data and create a
- let’s collect everybody’s initials, heights (cm), and birthday (
ddmm)
i <- "DP"
h <- 171
b <- 07053.1 tibble()
tibble(
initial = i,
height = h,
bday = b)# A tibble: 1 × 3
initial height bday
<chr> <dbl> <dbl>
1 DP 171 705
3.2 tribble()
- it might be easier to enter data row-by-row
- this is possible with a transposed tibble (
tribble)
- this is possible with a transposed tibble (
tribble(
~initial, ~height, ~month, ~day,
"DP", 171, 07, 05
)# A tibble: 1 × 4
initial height month day
<chr> <dbl> <dbl> <dbl>
1 DP 171 7 5
Example 5
- Save the tibble (i.e., data frame) as the object
df_wir - Explore the dataset (e.g., print summaries)
- What seems strange to you?
initial height month day
Length:1 Min. :171 Min. :5 Min. :7
Class :character 1st Qu.:171 1st Qu.:5 1st Qu.:7
Mode :character Median :171 Median :5 Median :7
Mean :171 Mean :5 Mean :7
3rd Qu.:171 3rd Qu.:5 3rd Qu.:7
Max. :171 Max. :5 Max. :7
4 Working with data
4.1 Variable types
readrguesses the type of data each column contains- the major columns types to know are
numericalandfactor
- the major columns types to know are
factors contain categories or groups of data, but can sometimes look likenumericaldata- for example, our column
monthcontains numbers, but it could also contain the name of each month - a good way to know which is which: it makes sense to calculate the mean of a
numericalvariable, but not of afactor- it makes sense to calculate our mean height, but not our mean birth month
- for example, our column
df_wir$month <- as_factor(df_wir$month)4.2 Indexing
- sometimes we want to access a certain variable (column) in a data frame
- using base R, we do that with
$:dataframe$variable
- using base R, we do that with
df_wir$height[1] 171
- and we can use this as the argument of a function
- try finding the
minimum andmaximum heights in our group - calculate the
sumof our heights
- try finding the
Example 6
- Convert
df_wir$monthto a factor - Compute our mean height using the function
mean()and indexing - Compute the sum of our heights.
4.3 Write to file
- we can also save our dataframe so that we can come back to it later with
write_csv(object, "filename") - be very careful! If you use an existing filename you can overwrite a dataset you already have
write_csv(df_wir, file = here("daten", "wir.csv"))5 Exercises
Here are some more in-depth exercises.
5.1 Import/export
- Load in the built-in
starwarsdataset, that contains information about Star Wars characters. - Export the data as a
csvfile calledstarwarsin yourdatenfolder. - Import the
starwars.csvdata file usingread_csv()
5.2 Convert column types
- Convert the following variables to factors:
hair_colorskin_coloreye_colorsexgenderhomeworldspecies
5.3 Plots
Produce the following three plots, and briefly describe what they show and any conclusions that can be drawn from them.
Create another plot of your choosing from the
starwarsdataset. Add it to the plot grid (you’ll have to adjust the syntax). Describe what it shows.
(fig_sw_height + fig_sw_gender) /
fig_sw_height_mass +
plot_layout(nrow = 2, heights = c(.4,.6))Heutige Ziele 🏁
Heute haben wir…
- learned how to eyeball a new data set ✅
- learned how to import different data types ✅
- learned how to enter data by-hand ✅
- visualised a new dataset ✅
Session Info
Hergestellt mit R version 4.2.3 (2023-03-15) (Shortstop Beagle) und RStudioversion 2023.3.0.386 (Cherry Blossom).
sessionInfo()R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.1.2 here_1.0.1 janitor_2.2.0 lubridate_1.9.2
[5] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.1 purrr_1.0.1
[9] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 compiler_4.2.3 pillar_1.9.0 tools_4.2.3
[5] bit_4.0.5 digest_0.6.31 timechange_0.2.0 jsonlite_1.8.4
[9] evaluate_0.20 lifecycle_1.0.3 gtable_0.3.3 pkgconfig_2.0.3
[13] rlang_1.1.0 cli_3.6.1 rstudioapi_0.14 parallel_4.2.3
[17] yaml_2.3.7 xfun_0.38 fastmap_1.1.1 withr_2.5.0
[21] knitr_1.42 generics_0.1.3 vctrs_0.6.1 htmlwidgets_1.6.2
[25] hms_1.1.3 bit64_4.0.5 rprojroot_2.0.3 grid_4.2.3
[29] tidyselect_1.2.0 snakecase_0.11.0 glue_1.6.2 R6_2.5.1
[33] fansi_1.0.4 vroom_1.6.1 rmarkdown_2.21 pacman_0.5.1
[37] farver_2.1.1 tzdb_0.3.0 magrittr_2.0.3 scales_1.2.1
[41] htmltools_0.5.5 colorspace_2.1-0 labeling_0.4.2 utf8_1.2.3
[45] stringi_1.7.12 munsell_0.5.0 crayon_1.5.2